DTSA 5510 Unsupervised Algorithms in Machine Learning Final Project

Mall Customers Segmentation

2023.09.17 D. Ikoma

1. Project Topic

This project shows how to perform mall customer segmentation using unsupervised machine learning algorithms. This is an unsupervised clustering problem that compares three popular algorithms: KMeans, Hierarchical clustering, and DBSCAN.

2. Data

In this section, raw data will be read, overviewed and checked if any cleaning is required (https://www.kaggle.com/datasets/shwetabh123/mall-customers)[1]. There was an error in the column name 'Gender' as 'Genre', so we have corrected the dataset in advance.

The summary of 5 columns is:

There is one binary, categorical column: 'Gender'. We will exclude binary data as it is inappropriate for clustering.

3. Data Cleaning

We will check for null data and perform basic data cleaning.

There are 5 columns and there is no null data in any column, so data cleaning is not necessary to handle null data.

4. Exploratory Data Analysis

In this section, we perform EDA such as the distribution and correlation of the data set to form a target for the models. First, we will let's check the age distribution by gender.

The total number of females is higher, but the average is about the same. For males, the distribution is relatively flat with respect to age, whereas for females, the frequency tends to be higher between the ages of 20 and 50. Next, we will check the annual income distribution by gender.

There are no characteristic large differences in mean, median, and standard deviation between the distributions of males and females. The distribution of annual income has a peak around 70k$. Finally, regarding 'Spending Score', we will check the distribution of Males and Females.

There doesn't seem to be much difference in the distribution of 'Spending Score' between Males and Females. The mean is slightly larger for Females, with a peak at 40-45.

Next, we will check the correlation. First, we will check the correlation between 'Age' and 'Annual Income' for Males and Females.

The correlation coefficients for both Males and Females are small, and there is no linear correlation between 'Age' and 'Annual Income'. Next, we will check the correlation between 'Age' and 'Spending Score'.

There is a negative correlation between 'Age' and 'Spending Score' for both males and females, and as 'Age' increases, 'Spending Score' decreases. The absolute value of the correlation coefficient for Females is -0.38, which is larger than -0.28 for Males. Finally, we will check the correlation between 'Annual Income' and 'Spending Score'.

There is no linear correlation for both Males and Females, but nonlinear features can be seen in the scatter plots. They appear to form five groups, and will be used as a reference for examining parameters for unsupervised learning.

We have summarized the distribution and correlation for each feature, 'Age', 'Annual Income', and 'Spending Score', for Males and Females. Overall, there was no difference in trends between males and females, and no outliers. Also, clusters could be confirmed in the scatter plot of 'Annual Income' and 'Spending Score'.

5. Models

The problem of this project is a classification problem for unsupervised learning. We compare and verify three models: KMeans, Hierarchical clustering, and DBSCAN. The binary data 'Gendre' is excluded from the analysis because it is not suitable as a feature for the classification problem and there is no difference between Males and Females based on the EDA results.

5-1. Feature scaling

We will perform feature scaling before modeling. The targets are 3 features: 'Age', 'Annual Income', and 'Spending Score'.

We performed the Standard Scaler on each feature and were able to transform it so that the means were 0 and the standard deviations were 1.

5-2. KMeans clustering

First, we will apply KMeans clustering. We set the number of clusters k as a hyperparameter and check the elbow plot.

From the above elbow plot, elbow can be read as k=4. Next, we will check the cluster when k=4 using a scatter plot.

In the diagram on the left, Label=0 (green) and Label=2 (blue) overlap. Also, in the figure on the right, Label=1(orange) and Label=1(red) overlap. In the figure on the right, k=5 looks appropriate, so let's visualize the case of k=5 as well.

In the figure on the right, the overlap between Label=2 (blue) and Label=4 (brown) has not been resolved, and I think there is no advantage in changing from k=4 to k=5. Next, when k=4, we will check the clustering in the 3D plot.

Checking the 3D plot, we can see that the clusters that appeared to overlap in the 2D scatter plot are clearly separated.

Finally, the characteristics of the class are summarized as follows.

Cluster (Label) Age Annual Income Spending Score
Cluster 0 - high low
Cluster 1 high low low
Cluster 2 low high high
Cluster 3 low low high

5-3. Hierarchical clustering

Next, we will perform hierarchical clustering. The model uses SKlearn's AggomerativeClustering.

The height of the dendrogram can be regarded as a hyperparameter, and similar to KMeans, it is well classified by the height (distance between clusters) where k=4.

Hierarchical clustering also achieved classification similar to KMeans.

5-4. DBSCAN

As the third model, we apply DBSCAN. Since DBSCAN was not explained in the course, we will provide an overview of the model in this section. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise and is one of clustering algorithms implemented in scikit-learn library [2]. As the name of paper suggests the core idea of DBSCAN is around concept of dense regions. The assumption is that natural clusters are composed of densely located points. This requires definition of “dense region”. To do these two parameters are required for DBSCAN algorithm. The first one is Eps, ε - distance, and the second one is MinPts – Minimum number of points within distance Eps. Optionally the distance metric can be specified by a user, but usually Euclidean distance is implemented (like in scikit learn). A “dense region” is therefore created by a minimum number of points within distance between all of them, Eps. Points which are within this distance but not close to minimum number of other points are treated as “border points”. Remaining ones are noise or outliers.

Pros

Cons

We will first create a matrix of investigated hyperparameter combinations.

A heatplot below shows how many clusters were generated by the DBSCAN algorithm for the respective parameters combinations.

The heatplot above shows, the number of clusters vary from 3 to 7. To decide which combination to choose we will use a metric - a silhuette score and we will plot it as a heatmap again.

Global maximum is 0.29 for eps=0.6 and min_samples=10, leading to 5 clusters .

Finally, we will draw a scatter plot. DBSCAN identifies outliers as Label=-1, so outliers are also displayed in the scatter plot.

DBSCAN recognizes areas with low 'Spending Score' as outliers, and in this case it seems that the identification was not successful. we tried checking the scatter plot with some other hyperparameters, but we couldn't identify areas with low 'Spending Score' well.

6. Results and Discussion

First, let's summarize processing time. (Please note that processing time may vary slightly from trial to trial.)

model processing time
KMeans 32.763 msec
Hierarchical clustering 5.064 msec
DBSCAN 4.007 msec

In terms of computational load, it can be seen that Hierarchical clustering and DBSCAN are equally fast, but KMeans takes over 6 times longer. Next, we will compare the three previous models in terms of discrimination performance using a scatter diagram.

KMeans clustering and Hierarchical clustering show almost the same classification results, although there are some differences in details. On the other hand, DBSCAN recognizes a lot of data as outliers and cannot classify them well. Considering the calculation time and classification performance, hierarchical clustering is considered to be the most suitable for this data.

Finally, the characteristics of the class are summarized as follows.

Cluster (Label) Age Annual Income Spending Score
Cluster 0 high low low
Cluster 1 low low high
Cluster 2 low high high
Cluster 3 mid high low

7. Conclusion

The conclusions of this project are summarized below.

As a future issue, it is necessary to dig deeper into the reasons why DBSCAN's classification was not successful, and to consider further metrics and hyperparameters.

References

[1] Dataset: https://www.kaggle.com/datasets/shwetabh123/mall-customers.

[2] Ester Martin, Kriegel Hans-Peter, Sander Jörg, and Xu Xiaowei, "A density-based algorithm for discovering clusters in large spatial databases with noise.," In KDD, 226–231, 1996.

[3] GitHub repository, https://github.com/DaisakuIkoma/CU_MSDS_USML.